A keyvowel approach to the synthesis of regional accents of English
نویسندگان
چکیده
Most English text-to-speech synthesisers offer one of only two accents: General American or RP. Developing a new accent is laborious, since it is not possible to choose one accent as a base form and systematically translate to others. We use the approach of Wells ([1]), categorising vowels in terms of abstract keywords that encode classes of words. Thus it is unnecessary to use a phonemic transcription in either the development or the execution of a synthesiser. The “keyvowel” system can be used throughout the synthesis system, avoiding the need to make accent-specific changes manually. The same linguistic resources can be re-used for each new accent. More fundamentally, the keyvowel system functions as a meta-accent that subsumes vowel-related information in all accents of English. 1. THE NEED FOR REGIONAL ACCENTS IN ENGLISH SPEECH SYNTHESIS A language may include several accents, differing not in their syntactic rules (as for dialects), but merely in the pronunciation rules. There are many accents of English, especially within the British Isles, but speech synthesisers have generally offered only General American or RP English. Most accents are mutually intelligible, but many users of synthesisers might prefer an accent closer to the one they are familiar with. This is especially true of vocally-impaired users, since the voice becomes their persona. The well-known British physicist Steven Hawking began to use a synthesiser with an American accent, since that was the only kind available at the time. After a long period of using it, he now has no wish to change to a British synthesiser, since “[he] would feel [he] had become a different person” [2]. This comment illustrates how fundamental the synthesiser’s accent is to the user’s self-perception. There is a need for more variety in the number of accents offered, not only for disabled people, but also for publicity and presentation. A Scottish bank offering synthetic speech telephone services would probably prefer a Scottish accent to an English one. In addition, the availability of accents would add variety and interest to consumer products that use synthesised speech. 2. REGIONAL ACCENTS AND SPEECH SYNTHESIS: THE PROBLEM Given that the synthesis of different accents is desirable, the next question is to decide on the most effective method. Various factors must be considered when selecting a method for synthesis. 2.1. Rule-based versus concatenative synthesis The question of different accents will differ in its impact on rule-based speech synthesis and on concatenative speech synthesis. In the case of the former, preparation of a new accent will require detailed acoustic-phonetic knowledge of the accent, as well as preparation of an accent-specific phonetic lexicon and letter-to-sound (LTS) rules, and detailed phonological knowledge. The detailed acoustic knowledge will require much basic research into the acoustic characteristics of the accent before synthesis can even be attempted. In the case of concatenative synthesis, this detailed knowledge of acoustic characteristics is not necessary. The resources needed for each accent are: the phoneset (phoneme inventory), the pronunciation lexicon, LTS rules, and a textual representation of a database of recorded speech. Even where the units are derived from a large database of continuous speech, this textual transcript of the database would still be required. 2.2. Types of linguistic variation between accents For concatenative synthesis using existing methods, each accent requires a new phoneset and lexicon, as well as recordings and transcriptions of an accentspecific speech database. These are non-trivial tasks. If two accents differed only in the phonetic realisation of the same phonological system, there would be no difficulty, as the same phoneset, lexicon and text could be used. Accents can differ more fundamentally than this, however, in the following ways (from [1]): 2.2.1. Differences in phonotactic distribution Two accents use the same phonological system, but the phonemes occur in different syllabic contexts. For example, both RP and Scottish English have the / / phoneme. In the latter it appears in any consonantal position in the syllable, but in RP it appears only before the vowel (i.e. in the onset) and not in the coda. 2.2.2. Differences in the phonemic system Two accents differ in the number or identity of phonemes: e.g., RP contains two low unrounded vowels, / / and / /, while Scottish English has only one, / /. 2.2.3. Differences in lexical distribution of phonemes Two accents may differ only in the phonemes selected for particular words. Even where two accents use the same phoneme system (unlike 2.2.2), and the same phonotactic distribution in syllables (unlike 2.2.1), the phonemes do not always appear in the same words. For example, a typical northern English accent and RP both contain / / and / /, with identical syllabic distribution but different lexical distribution: the northern accent has / / and RP has / / in “hook”, “look”. 2.3. Methods of encoding linguistic variation Since accents can differ in so many ways, existing methods of concatenative synthesis might use one of two approaches to develop a new accent of English: 2.3.1. “Brute force” approach Develop an entirely new lexicon and set of LTS rules for each accent. This entails much time and effort, as well as detailed phonological and lexical knowledge. If the addition of a new accent is seen as desirable but not essential, then in commercial terms this approach may be judged not cost-effective. 2.3.2. “Base accent” approach To simplify the process, develop a dictionary and set of LTS rules in a base accent (perhaps RP), and characterise each new accent's dictionary and LTS rules in terms of differences from this accent. Although apparently easier, under the second approach any accent chosen as a base accent will at some point fail to show a distinction that occurs in some other accent. There seems to be no single accent containing all possible phonemes and distinctions of English accents. For example, RP English differentiates certain vowels that are not distinguished in Scottish English (eg. / / and / /) but lacks another distinction made in some Scottish accents (between the vowels of “tied” and “tide”). Whichever accent is chosen for the master lexicon, there will be some loss of information from the point of view of other accents, and so a simple translation from an existing accent is not possible. 3 SOLUTION: A KEYVOWEL SYSTEM 3.1. Wells’ keyword system for English Wells ([1]) elaborates a system for classifying the vowel phonemes of English allowing for variations across accents. Instead of stating that the word “pool” contains the vowel [ ] in RP and the vowel [ ] in a Scottish accent, he states that it contains the GOOSE vowel, an abstract unit defined in terms of a class of words (eg. loop, group, move, duke, sleuth) rather than in terms of a specific pronunciation. The GOOSE vowel is later phonetically defined separately for RP and Scottish. Other keywords are KIT, THOUGHT and CLOTH, with a total of 27 vowel keywords. The string CLOTH (etc.) is treated as a symbol representing a wide range of actual vowel phonemes in various accents. In any given accent, it is possible for two or more keyword classes to be realised using the same vowel phoneme (for example, in near-RP accents, CLOTH and LOT words use the same vowel phoneme / /, but in General American the word classes use / / and / / respectively). 3.2. Goodbye to phonemic transcription This system avoids the need to re-specify all vowel phonemes for a different accent. If all vowels (in the lexicon and LTS rules) are specified in terms of keywords (and hence “keyvowels”), then exactly the same lexicon can be used for all accents. Given the use of a concatenative synthesis system, there is not even any need for a set of realisation rules giving the phonemes for that accent. The same text representation of isolated words can be used for all accents, and it is not necessary to research detailed acoustic-phonetic knowledge of the vowels of the different accents. The important point is that this cuts out altogether the use of phonemic transcription. In text-to-speech synthesis, there are two stages in generating speech: a) From orthographic form to phonemic transcription. b) Phonemic transcription to sequence of speech units. Using conventional methods of concatenative synthesis, both stages require extensive re-engineering when developing a new accent. The “keyvowel” method has two significant advantages over conventional methods: 3.2.1. Single stage during synthesis There is only one stage. The system converts from the orthographic form directly to speech units specified in terms of keyvowels, with no intermediate phonemic transcription. Instead of grapheme-to-phoneme rules, there will be a set of grapheme-to-keyvowel rules, for use in the rare cases where an input word is not found in the dictionary. The recorded database of speech units is specified in terms of the keyvowels and so can be accessed directly using them. 3.2.2. Maximal re-use of linguistic resources Re-engineering this single stage for a new accent requires no modification of the linguistic resources used by the system, merely the processing of a new voice. The recording subject is given a script of “real words” and hence automatically provides the appropriate realisation of each keyvowel in the given accent. 4 KEYVOWEL-BASED DICTIONARY A draft keyvowel dictionary has been produced, with 47781 entries. Each entry in this dictionary has three parts: index number, (lower-case) orthographic form), and pronunciation string. The vowel symbols in the pronunciation string represent keyvowels rather than actual phonemes of any particular accent.
منابع مشابه
Modeling and synthesis of English regional accents with pitch and duration correlates
This paper provides an introduction to the acoustic–phonetic structure of English regional accents and presents a signal processing method for the modeling and transformation of the acoustic correlates of English accents for example from British English to American English. The focus of this paper is on the modeling of intonation and duration correlates of accents as the modeling of formants is...
متن کاملThe generation of regional pronunciations of English for speech synthesis
Welsh and Northern English), and two American ones (New York and South Carolina, to represent Eastern and Southern American); regional features were based primarily on the descriptions in [1], with native-speaker input where possible. The regional accents are abbreviated in this paper as: Br(Sc) = Edinburgh; Br(W) = Cardiff; Br(N) = Leeds; Am(E) = New York; and Am(S) = South Carolina. For the s...
متن کاملSynthesis of regional English using a keyword lexicon
We discuss the use of an accent-independent keyword lexicon to synthesise speakers with different regional accents. The paper describes the system architecture and the transcription system used in the lexicon, and then focuses on the construction of word-lists for recording speakers. We illustrate by mentioning some of the features of Scottish and Irish English, which we are currently synthesis...
متن کاملRepresenting the environments for phonological processes in an accent-independent lexicon for synthesis of English
2], there is also variation amongst consonants, particularly post-vocalic (as in ‘horse’, RP versus Scottish English) and postalveolar (as in ‘news’, RP versus General American). This paper reports on work developing an accent-independent lexicon for use in synthesising speech in English. Lexica which use phonemic transcriptions are only suitable for one accent, and developing a lexicon for a n...
متن کاملThe Generation of Regional Pronunciations of English for Speech Synthesis1
Welsh and Northern English), and two American ones (New York and South Carolina, to represent Eastern and Southern American); regional features were based primarily on the descriptions in [1], with native-speaker input where possible. The regional accents are abbreviated in this paper as: Br(Sc) = Edinburgh; Br(W) = Cardiff; Br(N) = Leeds; Am(E) = New York; and Am(S) = South Carolina. For the s...
متن کامل